performance improvements to _factorize_str_helper #21

ARF1 · 2015-03-06T11:15:33Z

Speedup ca. x1.5 cf. master branch on my machine with compressed bcolz
Approximate contributions:

1/3: direct indexing into arrays using typed memoryviews
1/3: substitution of reverse dict with std::vector objects
1/3: use of nested with nogil, with gil construct

Speedup ca. x1.5 cf. master branch on my machine with compressed bcolz Approximate contributions: - 1/3: direct indexing into arrays using typed memoryviews - 1/3: substitution of reverse dict with std::vector objects - 1/3: use of nested `with nogil`, `with gil` construct

ARF1 · 2015-03-06T11:46:05Z

Uncompressed bcolz timings on my machine:

bquery master:
In [3]: %timeit -r 10 a.cache_factor(['mycol'], refresh=True)
1 loops, best of 10: 2.46 s per loop

pull request:
In [3]: %timeit -r 10 a.cache_factor(['isin'], refresh=True)
1 loops, best of 10: 1.59 s per loop

==> Factor: 1.5

Compressed bcolz timings on my machine:

bquery master:
In [3]: %timeit -r 10 a.cache_factor(['mycol'], refresh=True)
1 loops, best of 10: 4.03 s per loop

pull request:
In [3]: %timeit -r 10 a.cache_factor(['mycol'], refresh=True)
1 loops, best of 10: 3.13 s per loop

==> Factor: 1.3

Possible additional optimizations (but probably fairly minor):

There should be no need to store reverse_keys in _factorize_str_helper since it is merely an increasing sequence of integers up to reverse_values.size-1 (I wanted to keep the code logic as close to the original as possible.)
In factorize_str the reverse python dictionary (insertion expensive hash-table, I think) is created only to throw it away after creation of carray_values (Changing this would have obvious knock-on effects on other helper functions.)

Shared variables with manual locking: - hash table - count - reverse_keys - reverse_values - out_buffer - chunk_ Shared variables without locking requirement: - locks Thread-local variables: - thread_id - in_buffer_ptr (points to thread-local buffer) - out_buffer_ptr (points to thread-local buffer) Locking scheme: - For each thread a lock on the hash table (and other associated shared variables) exists. - Each thread processing a chunk begins by acquiring its own lock on the shared hash table. - The lock is released when the thread encounters an value that is new to the hash table. - Once the thread is ready to write to the hash table, it waits to acquire the locks from all threads. - After the write all locks are released. --- Uncompressed bcolz timings: ``` --- uncached unique() --- pandas (in-memory): In [10]: %timeit -r 10 c.unique() 1 loops, best of 10: 881 ms per loop bquery master over bcolz (persistent): In [12]: %timeit -r 10 a.unique('mycol') 1 loops, best of 10: 2.1 s per loop ==> x2.38 slower than pandas pull request over bcolz (persistent): In [8]: %timeit -r 10 a.unique('mycol') 1 loops, best of 10: 834 ms per loop ==> x1.05 FASTER than pandas ---- cache_factor --- bquery master over bcolz (persistent): In [3]: %timeit -r 10 a.cache_factor(['mycol'], refresh=True) 1 loops, best of 10: 2.51 s per loop pull request with 2 threads over bcolz (persistent): In [3]: %timeit -r 10 a.cache_factor(['mycol'], refresh=True) 1 loops, best of 10: 1.16 s per loop ==> x2.16 faster than master pull request with 1 thread over bcolz (persistent): In [3]: %timeit -r 10 a.cache_factor(['mycol'], refresh=True) 1 loops, best of 10: 1.69 s per loop ==> x1.48 faster than master (c.f. x1.48 from single-threaded PR visualfabriq#21) ==> parallel code seems to have no performance penalty on single-core machines ``` Compressed bcolz timings: ``` --- uncached unique() --- pandas (in-memory): In [10]: %timeit -r 10 c.unique() 1 loops, best of 10: 881 ms per loop bquery master over bcolz (persistent): In [12]: %timeit -r 10 a.unique('mycol') 1 loops, best of 10: 3.39 s per loop ==> x3.85 slower than pandas pull request over bcolz (persistent): In [8]: %timeit -r 10 a.unique('mycol') 1 loops, best of 10: 1.9 s per loop ==> x2.16 slower than pandas ---- cache_factor --- bquery master over bcolz (persistent): In [5]: %timeit -r 10 a.cache_factor(['mycol'], refresh=True) 1 loops, best of 10: 4.09 s per loop pull request with 2 threads over bcolz (persistent): In [5]: %timeit -r 10 a.cache_factor(['mycol'], refresh=True) 1 loops, best of 10: 2.48 s per loop ==> x1.65 faster than master pull request with 1 thread over bcolz (persistent): In [5]: %timeit -r 10 a.cache_factor(['mycol'], refresh=True) 1 loops, best of 10: 3.26 s per loop ==> x1.25 faster than master (c.f. x1.28 from single-threaded PR visualfabriq#21) ```

ARF1 mentioned this pull request Mar 6, 2015

Multi-core support for bquery #17

Open

removed reverse keys

49cbc1f

ARF1 mentioned this pull request Mar 13, 2015

Parallelised string factorisation #22

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

performance improvements to _factorize_str_helper #21

performance improvements to _factorize_str_helper #21

Uh oh!

ARF1 commented Mar 6, 2015

Uh oh!

ARF1 commented Mar 6, 2015

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

1 participant

performance improvements to _factorize_str_helper #21

Are you sure you want to change the base?

performance improvements to _factorize_str_helper #21

Uh oh!

Conversation

ARF1 commented Mar 6, 2015

Uh oh!

ARF1 commented Mar 6, 2015

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

1 participant